Search CORE

177 research outputs found

Full-fledged Real-Time Indexing for Constant Size Alphabets

Author: Kucherov Gregory
Nekrich Yakov
Publication venue
Publication date: 06/07/2013
Field of study

In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to

T

in O(1) worst-case time. At any moment, we can report all occurrences of a pattern

P

in the current text in

O(|P|+k)

time, where

|P|

is the length of

P

and

k

is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08})

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Reconsidering the significance of genomic word frequency

Author: Csűrös Miklós
Kucherov Gregory
Noé Laurent
Publication venue
Publication date: 14/09/2006
Field of study

We propose that the distribution of DNA words in genomic sequences can be primarily characterized by a double Pareto-lognormal distribution, which explains lognormal and power-law features found across all known genomes. Such a distribution may be the result of completely random sequence evolution by duplication processes. The parametrization of genomic word frequencies allows for an assessment of significance for frequent or rare sequence motifs

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

Estimating seed sensitivity on homogeneous alignments

Author: Kucherov Gregory
Noe Laurent
Ponty Yann
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2003
Field of study

We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

RNF: a general framework to evaluate NGS read mappers

Author: Boeva Valentina
Břinda Karel
Kucherov Gregory
Publication venue: 'Oxford University Press (OUP)'
Publication date: 02/04/2015
Field of study

Aligning reads to a reference sequence is a fundamental step in numerous bioinformatics pipelines. As a consequence, the sensitivity and precision of the mapping tool, applied with certain parameters to certain data, can critically affect the accuracy of produced results (e.g., in variant calling applications). Therefore, there has been an increasing demand of methods for comparing mappers and for measuring effects of their parameters. Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created. In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads. In order to solve this obstacle, we have created a generic format RNF (Read Naming Format) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RNF containing two principal components. MIShmash applies one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim etc.) and transforms the generated reads into RNF format. LAVEnder evaluates then a given read mapper using simulated reads in RNF format. A special attention is payed to mapping qualities that serve for parametrization of ROC curves, and to evaluation of the effect of read sample contamination

arXiv.org e-Print Archive

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Linear pattern matching on sparse suffix trees

Author: Kolpakov Roman
Kucherov Gregory
Starikovskaya Tatiana
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/03/2011
Field of study

Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to

\log_{\sigma}n

characters (

\sigma

the alphabet size), our index takes

O(n/\log_{\sigma}n)

space, i.e. the same space as the packed string itself. The resulting pattern matching algorithm runs in time

O(m+r^2+r\cdot occ)

, where

m

is the length of the pattern,

r

is the actual number of characters stored in a word and

occ

is the number of pattern occurrences

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

On the combinatorics of suffix arrays

Author: Kucherov Gregory
Tóthmérész Lilla
Vialette Stéphane
Publication venue
Publication date: 18/06/2012
Field of study

We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the characterization of suffix arrays for a special case of binary alphabet given in [2] easily follows from our characterization. Based on our results, we also provide simple proofs for the enumeration results for suffix arrays, obtained in [3]. Our approach to characterizing suffix arrays is the first that exploits their relationship with Burrows-Wheeler permutations

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM